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c7^ ' We consider a component of the word statistics known as clump; starting from a finite set of words, clumps are 

Cn ' maximal overlapping sets of these occurrences. This parameter has first been studied by Schbath | |22| ] with the aim of 

$^ , counting the number of occurrences of words in random texts. Later work with similar probabilistic approach used 

CIh' the Chen-Stein approximation for a compound Poisson distribution, where the number of clumps follows a law close 

•^F I to Poisson. Presently there is no combinatorial counterpart to this approach, and we fill the gap here. We emphasize 

the fact that, in contrast with the probabilistic approach which only provides asymptotic results, the combinatorial 

approach provides exact results that are useful when considering short sequences. 
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1 Introduction 



Counting words and motifs in random texts has provided extended studies with theoretical and practical 
reasons. Much of the present combinatorial research has built over the work of Guibas and Odlyzko [ |lO[ 
t^ . [ll|] who defined the autocorrelation polynomial of a word. As an apparently surprising consequence of 

their work, the waiting time for the first occurrence of the word 111 in a Bernoulli string with probability 
1/2 for zeroes and ones is larger than the waiting time for the first occurrence of the word 100. This is 
due to the fact that the words 111 occur by clumps of ones, the probability of extending a clump by one 
position being 1/2; this implies that the average number of 111 in a clump is larger than one; in contrast, 
there is only one 100 in each clump of 100. Since the probability that the word 111 and the word 100 
\^ , start at a given position both are 1/8, the interarrival time of clumps of 111 is larger than the interarrival 

Cf^ ' time of clumps of 100. 

—il , We analyze in this article several statictics connected to clumps of one word or of a reduced set of 

f^ ' words. Our approach is based on properties of the Regnier-Szpankowski [ [l8| ] decomposition of languages 

OO . along occuiTences of the considered word or set of words and on properties of the prefix codes generating 

^^ ' the clumps. We provide explicit generating functions in the Bernoulli model for statistics such as (i) the 

number of clumps, (ii) the number of A:-clumps, (iii) the number of positions of the texts covered by 
1^ ■ clumps, and (iv) the size of clumps in infinite texts; these results may be extended to a Markov model, 

r\ , providing some technicalities. We consider also in the Bernoulli model an algorithmic approach where 

C^ ' we construct deterministic finite automatas recognizing clumps. This approach extends directly to the 

Markov model, and we obtain as a direct consequence a Gaussian limit law for the number of clumps in 
random texts. 

Consider a rough first approximation for clumps of one word. If the probability occurrence of a word 
w is small, the probability of clumps K of this word is small. This implies that the number of clumps 
in texts of size n follows a Poisson law of parameter X — n x P{a clump starts at position i), where i 
is a random position. Approximating further, the random number of occurrences fi of the word w in a 
clump follows a geometric law with parameter uj, where oj is the probability of self-overlap of the word. 



Schbath and Reinert |[19| obtained in the Markov case of any order a coumpound Poisson limit law for 
the count of number of occurrences by the Chein-Stein method. See Reinert et al. [|^ for a review and 
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Barbour et al. [|l]] for an extensive introduction to the Poisson approximation. Schbath p^ ] give the first 
moment of the number of /c-clumps and of the number of clumps in Bernoulli texts. Recently, Stefanov 
et al. [ p4[ | use a stopping time method to compute the distribution of clumps; their results are not explicit 
and practical application of their method requires the inversion of a probability generating function. 
We d escribe in Section || our notations and the Regnier-Szpankowski language decomposition. Sec- 



tion 3^ and Section ^^ respectively provide our analysis in the case of counting clumps and fc-clumps of 
one word and of a finite set of words. We prove by an automaton construction a normal limit law for the 
number of clumps in Section ^ 

2 Preliminaries 

We consider a finite alphabet A. Unless explicitely stated when considering a Markov source, the texts 
are generated by a non-uniform Bernoulli source over the alphabet A. Given a set of words, clumps of 
these words may be seen as a generalization of runs of one letter. 

Clumps and fc-clumps. When considering a reduced set of words U = {ui, . . . ,Ur} where each word 
Ui has size at least 2, a clump is a maximal set of occurrences of words of U such that 

• any two consecutive letters of the clump belong to (is a factor of) at least one occurrence, 

• either the clump is composed of a single occurrence that overlaps no other occurrences, or each 
occurrence overlaps at least one other occurrence. 

This definition naturally applies also to the case where U is composed of a single word. 

As example, considering the set U = {aha^ bba] and the text T = bbbabababababbbbabaababb, we 
have 

T — b bbababababa bb bbaba aba bb 

where the clumps are underlined. The word bbababababa beginning at the second position of the text is 
a clump, and so are the words bbaba and aba beginning at the 15th and 20th positions. On the contrary, 
the word ababa beginning at the sixth position is not a clump since it is not maximal; neither is a clump 
the word bbabaaba beginning at the 15th position, since its two-letters factor aa is neither a factor of an 
occurrence of aba nor of an occurrence of bba. 

More formally, we use as an intermediate step clusters, following Goulden and Jackson [g|. 

Definition 1 (Clumps) A clustering-word /or the set U = {ui, . . . , Ur} is a word w ^ A* such that any 
two consecutive positions in w are covered by the same occurrence in w of a word u Cz U. The position 
i of the word w is covered by a word u if u = w[{j — |u| + 1) . . . j] for some j G {|u|, . . . , n} and 
j ~ \u\ + 1 < i < j. A cluster of a clustering-word w in K-u is a set of occurrence positions subsets 
{ Su C Occ(ii, w)\u ^lA} which covers exactly w, that is, every two consecutive positions i and i + 1 
in w are covered by at least one same occurrence of some u € U. More formally 

\/i e {1,. . . ,\w\-l} 3uel(,3peSu such that p- \u\ + 1 < i + 1 <p. 

A clump, generically denoted here by R is a maximal cluster in the sense that there exists no occurrence 
of the set U that overlaps the corresponding clustering word without being a factor of it. 

Note that a single word is a cluster and that, as mentionned previously, a clump may be composed of a 
single word. 

A k-clump of occurrences of U (denoted by A^'''> ) is a clump containing exactly fc occurrences of U. 

We aim here at providing explicit analytic formulas for the moments of the number of clumps, the total 
size of text covered by clumps or the number of clumps with exactly k occurrences. 
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Notations. We consider the residual language V = Cw~ as P = {x, x.w S £}. 

In case of ambiguity, we will use a bracket notation {C}{z, ...) to represent the generating function of 
the language {£}; in particular, for T) — Cw^, we write {C.w~}{z, ■ ■ ■) — 'D{z, . . .). 

Considering two languages £i and C2, if we have Ci C C2, we write L2 — Hi = £2 \ C.\ as the 
difference of sets; 

Reduced set of words. A set of words [/ — {wi, . . . , u^} is reduced if no Ui is factor of a Uj with i 
different of j. 

Autocorrelations, correlations and right extension sets of words. We recall here the definition of 
Right Extension Set introduced in Bassino et al. [||. 
The right extension set of a pair of words (hi . ^12) is 

£hi,h2 — { s I there exists e' G A^ such that hie — e'/i2 with < |e| < |ft.2|}- 

If the word hi is not factor of ft,2 this extension set of hi to /i2 is the usual correlation set of hi and /12 

When we have hi = /i2, we get the autocorrelation set Ch,h of the word h that we will note further C 
when there is no ambiguity. 

We also note Co — C — e. Remark that Co is empty if the word w has no autocorrelation. 

We remark here that the empty word e belongs to the autocorrelation set of a word. Note also that the 
correlation set of two words may be empty. 

We have as examples 

Caabaa,aab = {&, ab} , CababaMbaba — {e, ba, baba} . 

Generating functions. We aim at computing the number of a given object in random texts by use of 
generating functions such as 

L,{Z, X)=Y1 P(r)^l^lxl^l" = Y, ln.^x'z^ (1) 

Tec 

where \T\y is the number of occurrences of the object v in the text T and ^„ ^ is the probability that a text 
of size n has i occurrences of this object. This extends naturally for counting more than one object by 
considering multivariate generating functions with several parameters. 

If the random variable X„ counts the number of objects in a text of size n, we get from Equation (p 



E(X„) = [z"i ^^^^'""^ 



dx 



E(^^) = ni-^^^^'^^ 



dx dx 



x = l 

Recovering exactly or asymptotically these moments follows then from classical methods. 

3 Formal language approach 

3. 1 Regnier and Szpankowski decomposition 

Since our work extends the formal language approach of Regnier and Szpankowski [Q, we recall it here. 
Considering one word w, Regnier and Szpankowski use a natural parsing or decomposition of texts 
with at least one occurrence of w, where 

• there is a first occurrence at the right extremity of a "subtext", the set of which constitute a Right 
language, 

• possibly followed by other occurrences, that are separated by "subtexts" that constitute the Minimal 
language, 
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• and completed by "subtexts" that provide no other occurrences. 



Moreover, there is a language without any match of the considered word w. Regnier [|17|], further extended 
this approach to a reduced set of words. 

We follow here the book of Lothaire Ql^ (Chapter 7) which presents their method. 

We consider a set of words V — {vi,. . . ,Vr}. We have, formally 

Definition 2 Right, Minimal, Ultimate and Not languages 

• The : "Right" language TZi associated to the word Vi is the set of words 

TZi ~ {r \ r — e.Vi and ^e' (z V, r — xe'y, \y\ > 0}. 

• The "Minimal" language M.ij leading from a word Vi to a word Vj is the set of words 

Aiij — {m I Vi.m — e.Vj and fle' £ V, Vi.m — xe'y, \x\ > 0, \y\ > 0}. 

• The "Ultimate" language completing a text after an occurrence of the word Vi is the set of words 

Ui = {u\ /]e G V, Vi.u = xey, \x\ > 0}. 

• The "Not" language completing a text after an occurrence of the word Vi is the set of words 

TV = {n I /]e e y, n = xey}. 

The notations TZ,M,U and Af refer here to the Right, Minimal, Ultimate and Not languages of a single 
word. 

Considering as example the word w = ababa; in the following texts, the underlined words belong 
to the set Ai; the overlined text does not since the word represented in bold faces is an intermediate 
occurrence. 

ababa aaaaababa abahahabbbbbbbb abababa. 

Considering the matrix M such that My = Aiij, we have 

Ufe>i (M^),^^^. = A* ■ w, + C, - %e, U,-A = [j M,j + U, - e, (2) 

A-Uj- {Uj -Wj)=[j,^w,M^j, N- wj = 7^J + U 7^, (Cy - % e) . (3) 

i 

If the size of the texts is counted by the variable z and the occurrences of the words wi , . . . , Wr are counted 
respectively by xi , . . . , x^, we get the matrix equation 

/ Wi(z) 
F{z, xi, . . . , Xr) — M{z) + (a;i7^i(z), . . . , Xr'Ti.r{z)){l — M(z, Xi, . . . , Xr)) 



In this last equation, we have M.ij{z,xi, . . . ,Xr) = XjA4ij{z) and the generating functions TZi{z), 
A4ij (z), Uj (z) and Af{z) can be computed explicitly from the set of Equations (g, |]). 

In particular, when considering the Bernoulli weighted case A{z) — z and a single word w with 
TTtu = P{w), we have the set of equations 

Ri-) - ^^. M{^) = 1 + ^, U(z) = -l-,, Niz) ^(" 




D{z) ' "■ ' D{z)' '■ ' D(z)' "■ ' D{z) \D{z) tt^zI-I + (1 - z)C(z), 

(5) 
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A'^N + UM'-U =^ Fiz,x) = ^—^ = ^/n,fea;'=z". (6) 

l-Z + 7r„zl'"l — — — - n,k 

X + [1 — x)C{z) 
In this last equation, /„ jt is the probability that a text of size n has k occurrences of w. 

3.2 Clump analysis for one word 

The decomposition of Regnier and Szpankowski is based on a parsing by the occurrences of the consid- 
ered words. We use a similar approach, but parse with respect to the occurrences of clumps. As a major 
difference, when they consider the minimal language separating two occurrences, these two occurrences 
may overlap; in contrast, by definition, overlapping of clumps is forbidden. 

A clump of the word w is basically defined as wC*, since any element of Co concatanated to a cluster 
extends this cluster. 

Considering the word w ~ aaa, we have C ~ {e, a, aa} and C* is ambiguous. We can however 
generate unambiguously C* as described in the next section. 

3.2. 1 A prefix code /C to generate unambiguously C* 

Since Co is a finite language, it is possible to find a prefix code K, generating Co ; moreover, for ci , C2 G C — 
e and |ci | < |c2 1, the word ci is a proper suffix of C2. Otherwise stated, the prefix code K- = {ki, . . . ,Kk} 
is built over words qi,q2, . . . , qk and may be written as /C = {qi, q2qi, ■ ■ ■ , qtqk-i ■ ■ ■ 9i}- 

We Refer to Berstel and Perrin [Q] for an introduction to prefix codes. See also Berstel [g] for an 
analysis of counts of words of the pattern U by semaphore codes U — A^UA'^. We have the following 
lemma 

Lemma 1 The prefix code K, = Co\ CoA'^ generates unambigously the language C*. 

Proof: It is clear that K. is prefix. Consider w G Co — /C if this last set is not empty. Since /C is a set 
of words of C without any prefix in C, we have a contrario w = u.v with u and v non-empty and in 
C. We have |u| < \'w\ and \v\ < \w\; if w or w does not belong to K., we may iterate the process on 
the corresponding word. Since \w\ is finite, after a finite number of steps, we get to a decomposition 
w — Hii^ . . . Ki- where each ni is in JC. Since /C is a code, the decomposition of each word of C over K, is 
unique and so is the decomposition of any word of C*. □ 

Example 1 Let w — abaabaaba. We have 

abaabaaba\e 

, , , ^=> C — {e.aba.abaaba.baabaaba} =^ IC — {aba.baabaaba}. 

aba\abaaba 

a\baabaaba 

The periods of a word w is the set of integers {\h\,h e Co}; the irreducible periods is the subset 
of periods of which all the periods may be deduced. As follows from Guibas and Odlyzko [|lO|] and 
Rivals and Rahmann [piQ, when considering the word ababaccababa, the irreducible periods are 7,9 
while the period 11 can be deduced from the periods 7 and 9. However, we have here IC ~ C — 
{ccababa, baccababa, babaccababa} , which implies somehow against intuition that, in general, there is 
no bijection between the irreducible periods and the prefix code of a word. 
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Constructing the prefix-code /C. We use the following algorithm: 

1 . start with the word w; 

2. shift w to the right to the first self-overlapping position; Let ki be the traiUng suffix so obtained; 
insert it in a trie 8; 

3. repeat shifting, obtaining new trailing suffixes; for each new suffix generated, try an insertion in 
the trie. If you reach a leaf, drop the suffix; elsewhere insert it. 

The worst case complexity for this construction is 0(|w|), but the average complexity is 0(|A^| log(|A^|)), 
the average path length of a trie built over \1C\ keys. 

3.2.2 The language decomposition 

Considering the word w — aaaaa, we have C = {a, aa, aaa, aaaa} and JC = {a}. Moreover, we have 
Ai — {a, b{b + ab + aab + aaab + aaaab)* aaaaa} . We get here IC C A4 and A4 — IC — Cw; The 
language A4 and K, are indeed connected by a simple property that we describe now. 

Lemma 2 For any word w with autocorrelation set C, prefix code JC generating C* and minimal language 
Ai, there exists a non-empty language C such that 

ICcM and M-K = Cw. (7) 

Proof: We have /C C C and /C C A^; therefore, we have /C C A^ n C. We prove that if w G C - /C then 
w ^ M. Let us suppose that w ^ € and w ^ C — K.. This implies that w e ICA* by definition of JC. 
Therefore, we have w — kv with k G K. and |i;| > 0. As a consequence, w cannot belong to the minimal 
language A^, the word k corresponding to a previous occurrence of w. □ 

This leads immediately to the fundamental lemma. 

Lemma 3 The basic equation for the combinatorial decomposition of texts on the alphabet A where v 
counts some object in the clump of a word w is 

Al =Af + nw-{wC*)^{{M~IC)w-{wC*)^)*U, (8) 

Proof: The Equation (JS]) follows from the parsing 

• either there is no occurrence of w, the Not language J\f, 

• or 

1. we read until the first occurrence : TZw^w, 

2. followed by any number of overlapping occurrences of w (a clump less the first occurence): 
C\ 

3. followed by any number of 

(a) next occurrence of w without overlap: {Ai — JC)w~w 

(b) and any number of overlapping occurrences of w: C* . 

D 
We can now use the preceeding lemma to count several parameters related to the clumps. 
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3.2.3 Counting parameters related to the clumps 

Let ^(z, x,t) be the generating function where the variable x counts the number of occurrence of w in a 
clump, and the variable t counts the size of the clumps; the variable z is used here to count the total length 
of the texts. We also use a variable u to count the number of clumps. We have the following theorem 

Theorem 1 In the weighted model such that A{z) — z, the generating function counting the number of 
occurrences of a word w and the number of positions covered by the clumps ofw verifies 

F{z, ^{z, X, t)) = N{z) + ^^^Mzt, x) ^,._l,. U{z) (9) 

where the generating function of the clumps verifies 

^{z,X,t)^XTT^{zt)\'"\- — — (10) 

1 — xK,(zt) 
As a consequence, the generating function counting also the number of clumps is 

G{z,x,t,u) = F{z,uM.{z,x,t)). (11) 

Proof: This theorem follows from Lemma (|l|) and from a direct translation of Equation (^ into generating 
functions. □ 

3. 2. 4 Occurrences of clumps. 

Considering G{z, uM.{z, 1, 1)) in Equation (g) and using Equation (|o|) provides the generating function 

O^-'Hz u)^^o'-^K'z"=JV(z] + uTZ{z)U{z) 

h ^ ' I - uMiz) + [u - l)ICiz) ^ ' 

where o^^ ,- is the probability of getting i clumps (of any size) in a text of size n. Considermg r„, the 
expectation of number of clumps in texts of size n, we get by differentiation 

TZ{z)U{z)(l ~ IC{z)) 7r^zl™l(l-/C(z)) 



E^m- 



(l-7W(z))2 (1-z) 



2 



This implies that r„ = (n — \w\ + l)7r^(l — /C(l)) — 7r^/C'(l), to compare with the expectation (n — 
\w\ + l)7r^ of the numerb of occurrences of the word w. 

3. 2. 5 Occurrences of k -clumps. 

By considering the equation of a clump of occurrences of w, we can write 

wC* = w + wlC + wK.^ + ...{v-l + 1)wK:''-'^ + ... 

to count clumps with exactly k occurrences of w. 

Writing M.^''^ {z, v) the generating function which counts with the variable z the size of the clumps and 
where the variable v selects fc-clumps, we have 

J^(^) (Z, V) = TT^zl^l ( Y^^ + (^ - 1)^(^)'"') 

Substituting this in Equation ^ gives 



where o„ J" is the probability that a text of size n contains exactly i fc-clumps. 



X R(f'){z,v) 
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3.2. 6 Probability that a random position is covered by a clump 

This follows from the knowledge of the number of positions of the texts covered by the clumps. 

Let Pn be the random variable counting the number of positions covered by the clumps of a word w in 
texts of size n and i?„ be the probability that a random position is covered by a clump in a text of size n. 

Let F{z, t) ~ G{z, R{zt^ 1)) where G{z^ ^) is given by Equation (y) be the generating function count- 
ing the size of the texts and the number of positions covered by clumps. We have 

,;>0 " "'0 i = l 

3. 3 Clumps of a finite set of words 

We provide in this section a matricial solution for counting clumps of a reduced finite set of words. For 
simplicity sake we consider a set of two words wi and W2 but our approach is amenable to any reduced 
finite set. 

Similarly to the one word case, we are lead to consider prefix codes generating the correlation of 
two words. Writing C* with i ^ j makes no sense in terms of language decomposition. However, we 
can write as previously A^y = Cij — CijA^ , which defines a minimal correlation language with good 
properties. 

We have as examples 

Example 2 Let wi — aabaa and w2 = aaa. We have C12 = {a. aa} and /C12 = {a}. In this case, we 
have C12 = C22 - {e} and K12 = 1C22- 

Example 3 Let wi — abab and w2 — baba. We have C12 = JC12 = {a, aba}. In this case, we have 
C22 = {e, ba} and JC12 = a./C22- 

Following a proof similar to the proof of Lemma (g|), there exists a language C such that 

Mij — fCij = Cwj. 

We can therefore write a minimal correlation matrix K, consider the matrix S = K* and write a clump 
matrix G as follows 



ICii /C12 \ § - K* G — ( "^1^11 wi§i2 

IC21 IC22 / ' ' \ W2S21 W2S22 



(13) 



In this equation, G^ is a clump starting with the word w, and finishing with the word Wj . We obtain now 
a fundamental matricial decomposition that can be used for further analysis, 

y4* = (7^lW^,7^2U'2")G((M-K)-Gj* ( ^^ 

where we have (M - K)r. = {Mij - K,ij)wJ . 



4 Automaton approach 



For a set [/ = {ui, . . . , u^}, we build a kind of "Aho-Corasick" automaton built on the following set of 
words X 

X — {ui ■ w I 1 < i < r and w £ {e} U £i,j for some j}. 

The automaton T is built on X with Q = Prcf (X) (set of states), i = e (initial state). The transition 
function is defined (as in Aho-Corasick construction) as 

S{p, x) = the longest suffix of px e Pref (X). 
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In order to count the number of clumps (for instance) the set of final states T needs more attention: it is 
defined as 

T = X\XA+. 

This automaton accepts the language of words ending by the first occurrence of a word in a clump. 

We can easily derive from this automaton the generating function f(z, xi, . . . , Xr, t, u) where Xi marks 
an occurrence of Ui, t marks the number of clumps, and u the total length covered by the clump. Indeed, 
one has to mark some transitions in the adjacency matrix A according to some simple rules. 

• To count occurrences of the u^'s, we have to mark with the formal variable Xi transitions going to 
states A*u^ n Prcf(X) (for 1 < J < r). 

• For the number of clumps, on can mark transitions going to states inU \ UA^ = X \ XA'^, that 
is states corresponding to first occurences inside a clump. 

• Finally, for the total length covered by clumps. We have to put a formal weight on transitions 
going to a state p G A*U f) Pref (X) taking into account the number of symbols between the last 
occurrence of a word of X and the new one at the end of p. Let us define for a state p (corresponding 
to a word with a occurrence of some word of X at the end) the function £{p) the maximal proper 
prefix q of p in A*U if it exists or e if there is no such prefix. Then we must mark all transitions 
going to p with ulp|-l^(p)l (if p e A*U n Prcf (X)). 

Of course the construction does not gives a minimal automaton. However the automaton is complete 
and deterministic so that the translation to generating function is straightforward. 

Example 

1. For one word U = {u = bababa}, and £„ = {ba, baba}. The set X is 

X = {bababa, babababa, bababababa}. 




N.B.: The sign '+' on the automaton indicates that the corresponding prefix ends with some oc- 
curence of U. The double oval states indicates the states where we know we have entered a new 
clump. 

2. For the set U ~ {ui = aabaa, U2 — baab} and the matrix of right extension sets is 

baa + abaa b 



£ 



aa aab 



The set X is 



X — {aabaa, aabaab, aabaabaa, aabaaabaa, baab, baabaa, baabaab}. 
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We have the following automaton (with x and y marking occurences respectively of ui and U2- 
The automaton is complete and deterministic. However, for clarity's sake, all transitions labelled 
by a and b ending respectively on state A and B are omitted. As before, the sign '+' indicates that 
the corresponding prefix (or, equivalently, state) ends with some occurence of U. The double oval 
attribute indicates the state where we know we have entered a new clump. 




5 Limit laws 

5. 1 Normal law 

A normal limit law for the number of clumps U when U = 0(n) in texts of size n follows from the 
automaton construction of Section H. A Perron-Frobenius property asserts the existence of a unique dom- 
inant eigenvalue of the positive system; apply next a suitable Cauchy integral and large power Theorem 
of Hwang [g g; see [|l6| for details. 

5.2 Poisson law for rare words 

In a Bernoulli model, if p and p are the minimal and maximal probability of letters of the alphabet, words 
of size I < j^ °,Y/ ■) have 0{n) number of occurrences in texts of size n with probability one. We consider 
rare words with size over this threshold and number of occurrences 0(1). We prove in this case a Poisson- 
like limit law. Taking a Taylor expansion of O'^'^^z, u) in Equation ( p^ at w == 0, and considering the 
/cth Taylor coefficient, with k ~ 0{1) provide a rational generating function with respect to the variable 
z of the form 



Hk{z) = [u'']0'^'^\z,u)^ 



n{z)U{z){M{z) - K,{z)f-^ _ ^n^z^'^^z - 1 + (1 - }C{z))D{z)) 



k-l 



(1-/C(z))fc 



{l-IC(z))HD{z)) 



k+l 



(14) 

We follow Fayolle [Q] to prove that the dominant root of the denominator of this last equation is the 
smallest and positive root of D{z) = tt^jzI™' + (1 — z)C(z); (see Equation (q|)). Let d be the smallest 
period of w. If d < Z/2 classical results about periods on words provide C(z) = 1 + 7r„zl"l + • ■ • + 
(7r,(z'"l)'' + 5(z) for a given word M with |m| < ^/2, andr > 2; moreover S'(z) is a polynomial of minimal 
degree at least 1 12. Moreover, we have AC(z) — tTuz'"' + R(z) where S'(z) — R(z) is a polynomial with 
positive coefficients. This entails that |S'(z)| and |i?(z)| are o(l) for \z\ < 1/p. Up to negligible terms, 
we get 



\C{z)\ = 



1 



1 -TT^zl^l 



> 



1 



1 



+ 7r„ z 



l"l 



> 



1 



l+p\z\ 



for 



1 
< -. 

P 
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We alsohave |1 — /C(z)| > OandTr^uzl"'! =o(l)for|z| < 1/p. TheRouchetheoremin thedisk [zj < 1/p 
the generating function Hk{z) has a single pole which is a smallest modulus root p of the equation 
D{z) — 0. Perron-Frobenius considerations on the automaton counting the number of occurrences of w 
imply that this pole is real positive. A similar proof follows when d > 1/2. 

Writing D{z) ~ Q{z){l — z/p) and P{z) ~ z — I + (1 — IC{z))D{z) we get as a first approximation 



— n 



^^^--^^- Qip) ^fc!(,(l-^(p))0(p)j "^ 
A similar behaviour has been observed for occurrences of one word by Regnier and Szpankowski [|1 

5.3 Length of the clumps in infinite texts 

Generating function of the size of the clumps in infinite texts is a sum of geometric random variables. 

6 Conclusion 

An interesting application of this article would be a combinatorial analysis of tandem repeats or multiple 
repeats that occur in genomes; large variations of such repeats are characteristics of some genetic diseases. 
Would it be possible to extend our approach to clumps of regular expressions? We consider clumps of a 
regular expression (i.e. contiguous sets of positions such that each position is covered by at least one word 
of the associated regular language and such that leading and terminating positions of each occurrence is 
covered by at least two occurrences). In this case the star-height theorem (CITE) inplies that we cannot 
in general find a finite set of words Wi and a finite set of prefix codes ICi with 1 < i < £ such that the 
language lJi<i<n Wi{K,i)* describes the clumps. 
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